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Abstract 

Graphic Processing Units (GPUs) are getting increasingly important as target architectures in scientific High Per- 
formance Computing (HPC). NVIDIA established CUDA as a parallel computing architecture controlling and making 
use of the compute power of their GPUs. CUDA provides sufficient support for C++ language elements to enable the 
Expression Template (ET) technique in the device memory domain. 

QDP++ is a C++ vector class library suited for quantum field theory which provides vector data types and expres- 
sions and forms the basis of the lattice QCD software suite Chroma. 

In this work accelerating QDP++ expression evaluation to a GPU was successfully implemented leveraging the ET 
technique and using Just-In-Time (JIT) compilation. The Portable Expression Template Engine (PETE) and the C API 
for CUDA kernel arguments were used to build the bridge between host and device memory domains. This provides 
the possibility to accelerate Chroma routines to a GPU which are typically not subject to special optimisation. As 
an application example a smearing routine was accelerated to execute on a GPU. A significant speed-up compared to 
normal CPU execution could be measured. 
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1. Introduction 

GPUs are getting increasingly important in scientific 
HPC. Massively multicore architectures supported by 
high bandwidth memory buses make them an attractive 
target architecture for either floating point operation rich 
or memory access intensive applications. 

NIVIDA established CUDA 1 1] as their pai-allel com- 
puting architecture. It enables HPC scientists to dramat- 
ically increase the computing performance by taking ad- 
vantage of the compute power of GPUs. While former 
releases of CUDA mostly supported a C-like Applica- 
tion Programming Interface (API) with little support for 
C++ features, the upcoming release 4.0 supports C++ 
features in a much broader way attracting an even larger 
set of applications and libraries to be made subject to 
multicore acceleration. 

Active libraries typically implemented utilising meta- 
programming methods have proven to provide domain- 
specific abstractions to the user and on the other hand to 
provide compilers with sufficient freedom to optimise 
the code to a satisfactory level. They combine the ben- 
efits of built-in language abstractions (convenient API, 
translation to efficient code) with those of library-level 
abstractions (information hiding, adaptability) |2, 3|. 



Quantum Chromodynamics (QCD) is the theory of 
the strong force between gluons and quarks. The for- 
mulation of the theory on a discrete space-time lattice is 
called lattice QCD and has opened the path to numerical 
calculations. Lattice QCD has received much attention 
as a "grand challenge" problem in scientific HPC. Cur- 
rent lattice calculations demand computational work of 
sustained peta-flops and compute resources have been 
and continue to be the main limiting factor to large scale 
lattice QCD calculations. 

Although heavily dependent on the simulation pa- 
rameters typically the largest portion of the compute 
time in lattice QCD calculations is spent solving a sys- 
tem of linear equations when inverting the so called 
fermion matrix. Typically most of the work invested 
in optimisation of lattice QCD applications is spent on 
this part leaving the remaining parts of the calculation 
fairly unoptimised. 

GPUs form an attractive platform upon which to de- 
ploy large scale lattice QCD calculations [4|. To date a 
lot of effort has focussed on optimising the solver part 
of lattice QCD applications |l5l|6l|7l[8l. Outstanding im- 
plementations of several inverters which achieve a very 
high sustained performance using a mixed precision ap- 
proach combined with reliable updates became avail- 



January 20, 2013 



able |l9l[T0l. Seamless integration for these solvers are 
provided for a number of lattice QCD software suites 
including Chroma which builds on top of QDP++ [ llj . 

However, as these solvers optimised for the CUDA 
architecture provide for a significant speed-up of the 
inversion of the fermion matrix, the remaining (unop- 
timised) parts of the calculation start to dominate the 
overall execution time. 

Currently GPU enabled software packages typically 
have some short comings: A particular part of the cal- 
culation is either executed on the GPU or the CPU leav- 
ing the respective other system mostly idle. This is not 
ideal taking into account the roughly equal acquisition 
cost and power consumption of both systems. 

On the other hand heterogeneous computing archi- 
tectures offer the possibility to enable a cooperative 
computing environment where the general purpose and 
specialised processors are working together in an in- 
terleaved fashion. The idea is to split the calculation 
into several small tasks and to deploy multiple types of 
processing elements within a single workflow each as- 
signed to the task its best suited for. 

In order to install this "fine-grained" structure in 
Chroma acceleration is directly implemented in the un- 
derlying library QDP++ assigning the code parts to ex- 
ecute on the processor element it is best suited for, i.e. 
lO-intensive operations on the CPU, floating-point rich 
operations on the GPU. In this way not only a finite set 
of selected functions are executed on the accelerator but 
acceleration is applied in a more general fashion. 

Benchmark measurements confirmed that solely ac- 
celerating the floating-point rich operations yet yields 
a significant speedup factor compared to unaccelerated 
execution. 

The data layout, the order of access, the precision of 
the primitives are left unchanged just as prescribed by 
the QDP-H-i- standard template order 

Section [2] introduces briefly the CUDA architecture. 
The ET technique is briefly outlined in section [3] Sec- 
tion |4] introduces QDP+-I-. Section [5]introduces the de- 
sign elements introduced to QDP+-I- for accelerating the 
evaluation. Section[6]details on the new QDP-I-+ API el- 
ements. An application example is detailed in section]?] 
Section|8]details on benchmarking results. 

2. The CUDA Architecture 

The CUDA architecture is built around a set of multi- 
threaded Streaming Multiprocessors (SM) which pro- 
vide the main compute power Threads are organised in 
a hierarchical manner: A kernel grid is a collection of 



thread blocks. A thread block is a collection of threads 
and represents an indivisible unit allocatable by a mul- 
tiprocessor Whether a given thread block can be al- 
located by a SM depends on the resources (number of 
registers and shared memory) its threads collectively re- 
quire and the resources available on the SM. The threads 
of a thread block execute concurrently on one SM, and 
multiple thread blocks can execute concurrently on one 
SM (one block active at a time). 

The Streaming Multiprocessors implement the 
Single-Instruction, Multiple-Thread (SIMT) architec- 
ture. Threads resident on one SM are bundled into 
groups of 32 parallel threads, so called warps. The 
threads of exactly one warp are executed in SIMT fash- 
ion. Individual threads composing a warp start together 
at the same program address, but they feature their own 
instruction register and are free to branch independently. 
However, in this case execution is serialised. 

The SM is able to hide latency to device memory by 
switching execution to a different warp whose instruc- 
tion is ready to execute. It is therefore beneficial to or- 
ganise the thread number per SM in such a way that a 
suflicient number of warps is resident to (ideally) com- 
pletely hide the latency to device memory. 

3. Expression Templates 

C-I--H function and class templates together with func- 
tion and operator overloading offer the possibility to 
represent expressions as C-I-+ types. This technique is 
commonly referred to as Expression Templates (ETs) 
and was first introduced by Todd Veldhuizen lfT2l and 
David Vandevoorde. 

ETs provide a means to eliminate the need for creat- 
ing temporaries when implementing a C-H-i- vector class 
library which features both a convenient API such as 
for domain specific languages desired and a high perfor- 
mance of the translated code. However, the latter feature 
relies on the optimisation abilities of the compiler and 
is not ensured in all ET applications, e.g. for Basic Lin- 
ear Algebra Subprograms (BLAS) Level 2 and 3. Here 
different, but also ET based approaches may achieve a 
better performance |13 |. 

The Portable Expression Template Engine (PETE) pi- 
oneered the use of the ET technique for parallel physics 
computations ||3l. It is an extensible implementation 
of the ET technique and achieves an exceptional level 
of abstraction without sacrificing performance and pro- 
vides the core functionality (on the vector level) of 

QDP-HH-. 

Implementation of a vector library utilising the ET 
technique typically involves defining a template of the 
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evaluate function. Upon assignment of an expression 
the compiler generates an instantiation of the evaluate 
function matching the expression. The evaluate function 
then typically implements a loop iterating over all vector 
components - no temporary vector objects are required. 

C++ compilers offer the possibility to access the 
function template's arguments in a fully instantiated 
C++ type in form of a C string - so called pretty print- 
ing, which provides a means to access at runtime the 
expressions. 

4. QDP++ 

QCD Data Parallel (QDP++) is a C++ vector class h- 
brary suited for quantum field theory. It forms the basis 
for the widely used lattice QCD software suite Chroma 
and as such provides the lattice wide data types and ex- 
pressions used in Chroma 1 11 1. Chroma implements lin- 
ear algebra operations which may include nearest neigh- 
bour communications utilising the QDP++ API. 

Although not designed originally for multicore accel- 
eration, this work demonstrates that design elements can 
be added to QDP++ in such a way that evaluations of 
arbitrary expressions are accelerated and executed on a 
GPU. This approach was previously established when 
deploying lattice QCD applications to the QPACE su- 
percomputer L14i J5J . 

5. Accelerating QDP++ evaluation on a GPU 

Accelerating QDP++ expressions on a GPU relies on 
leveraging the ET technique to the device memory do- 
main. A crucial component for ETs to work in a par- 
ticular memory domain is (besides the compiler's abil- 
ity to handle C++ templates) the ability to take mem- 
ory addresses of functions and allowing for dynamically 
dereferencing function addresses (function pointers). 

The new release 4.0 of CUDA on devices with com- 
pute capability no less than 2.0 meets the requirements 
for the ET technique to work on the device memory do- 
main. 

Unfortunately CUDA provides only a C-interface to 
kernel functions making it impossible to directly pass 
C++ constructed expressions as arguments to compute 
kernels. 

This work circumvents the aforementioned lack of a 
C++ API to kernel arguments by first constructing in 
device memory domain an object of an equivalent C++ 
expression type as used during host code translation 
and second deploying the missing runtime configurable 
Plain Old Data (POD) parts by copying those from the 
host side expression object. 



In order to construct the object in device memory do- 
main a JIT compilation of CUDA kernel modules is 
triggered upon expression evaluation. Launching this 
kernel constructs the required object in device memory. 
Still the runtime configurable parts are missing. 

To build the bridge between the two expression ob- 
jects residing in different memory domains the POD 
part of the expression object on host side is copied into 
a C API compatible form which is allowed to be passed 
as CUDA kernel arguments. 

5.7. Dynamic Code Translation, Just-In-Time Compila- 
tion 

Entering the evaluation function triggers pretty print- 
ing of its arguments and executing an external code 
generator which generates C++ device code leverag- 
ing the ET technique. The NVIDIA Compiler NVCC 
is invoked which builds a shared library containing the 
CUDA kernel. 

The shared library is loaded via the dynamic linking 
loader and the kernel is executed on the device. 

After evaluation the shared library is kept loaded until 
the application exits. This ensures that each kernel func- 
tion is only generated once and subsequent calls branch 
to the already loaded shared library. 

5.2. Flattening the Expression Tree 

PETE 13] provides means to traverse the expression 
tree and execute custom operations on the tree nodes 
and leaves. This method is used on host side to collect 
the runtime configurable data (POD portion) of each op- 
erator and to store them into a linear storage container 
that can be passed (as a pointer) to the CUDA kernel as 
an argument, i.e. the expression tree is flattened. 

The inverse operation restores the expression tree on 
device sid^ 

Certain runtime configurable operators require spe- 
cial treatment. E.g. the shift operation requires read- 
only access to a site table index initialised at runtime. 

A device storage container was added in such a way 
that the first call to a particular runtime configurable op- 
eration triggers copying of the required data to device 
memory. The storage container keeps track of already 
transferred memory regions and subsequent calls to the 
same operation do not trigger the device copy again but 
make usage of the akeady resident data in device mem- 
ory. The site tables remain in device memory until they 
are explicitly freed by the user This speeds up repeated 
calls to the same shift function. 



'For unknown reasons the NVIDIA Frontend++ traverses the ex- 
pression tree in a mirrored sense compared to the GNU C++ Com- 
piler. 
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5.3. Mixed Memory Domain Approach 

Since device memory is (still) a scarce resource a 
mixed memory domain approach was favoured: Mem- 
ory allocation for lattice wide objects utilise the host 
memory domain. Upon user request the object's data 
is pushed to the device memory domain. 

A new feature coming with version 4.0 of CUDA pro- 
vides the possibility to page-lock a memory range that 
was already allocated (4kB aligned) in the host memory 
domain and to add it to the tracking mechanism to auto- 
matically accelerate calls to device copy functions. This 
mechanism eliminates the previously required staging 
of data regions prior to the transfer to device memory 
and reduces pressure on host memory. 

5.4. Thread Geometry 

The evaluation function template in ET based vector 
libraries typically triggers execution of a loop iterating 
over all lattice sites. With CUDA, parallelisation of a 
loop is typically carried out unrolling the loop and start- 
ing one thread per loop iteration. 

Since here the applied CUDA kernels not only consist 
of processing the lattice sites but also require prior re- 
construction of the expression tree it was not clear that 
the typical approach leads to the best performance. A 
software configuration parameter A^site was introduced 
which specifies the number of sites assigned to one 
thread. 

CUDA kernel functions are launched with specifying 
the grid and block geometries. Thus a software configu- 
ration parameter Mhreads is introduced that specifies the 
number of threads per block. 

Given the total number of lattice sites the grid geom- 
etry is a function of threads and A^site- 

For each expression £, a separate CUDA kernel is 
generated. Thus the sustained performance f is a func- 
tion of the number of threads per block, the number of 
lattice sites processed per thread, and the expression E,: 

P(Nthre^ds,N,its,Ei). 

CUDA enabled software packages might be equipped 
with an auto-tuning mechanism that determines the op- 
timal grid and block geometries for the particular set of 
installed devices. Auto-tuning is run prior to production 
and the geometry parameters that yield the best perfor- 
mance are stored for later inclusion. 



Listing 1: Modified Ciiroma implementation for Jacobi smearing. 
Prior to any calculation the lattice wide objects are pushed to the de- 
vice (first darker grey shaded region). After calculation the result ob- 
ject is copied back to host memory and device memory is freed (sec- 
ond shaded region). QDP++ expressions (line number): £o(16. 20), 
£i(25),£'2(27),£3(30),£4(33). 
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void j acobiSmear ( const 
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const RealSc kappa, int 


iter , int 
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T psi ; 
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Real norm; 
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T s_0 , h_smear ; 
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psi .pushToDeviceO ; 
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forCint mu = 0; mu < Nd; 


++mu ) 



u[mu] .pushToDeviceO ; 
chi .pushToDeviceO ; 
h.smear . pushToDevice ; 
s_0 .pushToDevice () ; 

s_0 = chi ; 

for(int n = 0; n < iter; ++n) 
{ 

psi = chi ; 

bool first = true; 

for (int mu = 0; mu < Nd; ++mu ) 

{ 

if (first) 

h_smear = u [mu] * shift (psi , FORWARD 
, mu) + shif t ( adj (u [mu] ) *psi , 
BACKWARD, mu); 

else 

h_smear += u [mu] * shif t (psi , FORWARD 
, mu) + shif t ( adj ( u [mu] ) *psi , 
BACKWARD, mu); 

f irst = false ; 

} 

chi = s_0 + kappa * h_smear; 

} 

chi /= _norm ; 

chi .popFromDeviceO ; 

for(int mu = 0; mu < Nd; ++mu ) 

u[mu] . f reeDeviceMemO ; 
psi . f reeDeviceMemO ; 
h.smear . f reeDeviceMemO ; 
s_0 . f reeDeviceMemO ; 



6. New QDP++ API Elements 

The QDP-n- API was extended by the following ele- 
ments: 
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• OLattice: :pushToLattice() 

Allocates a memory region of the object's size in 
device memory and copies the object's data to the 
device. 

• OLattice: :popFromLattice() 

Copies the data from the device to host memory 
and frees device memory. 

• OLattice :: freeDeviceMemO 
Frees device memory. 

• theDeviceStorage : : f reeAll 

Frees device memory previously allocated for run- 
time configurable operators. 

7. Application Example: Jacobi Smearing 

Frequently used Chroma routines which are typically 
not subject to special optimisation and which can con- 
sume a significant amount of (wallclock) time when ex- 
ecuted on a few CPU cores only are quark smearing rou- 
tines. These are operations acting on lattice wide ob- 
jects and are typically iterative prescriptions including 
nearest neighbour communications. One of these rou- 
tines implements Jacobi smearing [16] and serves here 
for the benchmarking analysis. 

The Jacobi smearing procedure is obtained by solving 
the Klein-Gordon equation 

K(x,x')S(x',0)^6_,,o (1) 

where 

K,,x' = Sjcx-Ks ^ Uf,{x)6x',x+^ + Ul(x-iS)6jc'^x-^ (2) 
as a power series in ks stopping at some finite power 

^ smear • 

Listing. [T] shows the Chroma implementation of Ja- 
cobi smearing using the QDP+-I- API . Five QDP-H-i- ex- 
pressions are involved: Efj, with < ju < 4 where the 
most compute intensive ones are £1 and £2. 

The code lines shaded darker grey were introduced 
to enable execution on the GPU. All lattice objects are 
pushed to the device. After calculation the device mem- 
ory is freed and the result objects copied back to its orig- 
inal location in host memory. 

8. Benchmark Results 

Benchmarking analyses were carried out using a 
NVIDIA GeForce GTX 480. This device has 1.5 GB of 
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Table 1: Benchmark results for the Jacobi smearing routine executed 
on the CPU and the GPU. Numbers in sustained GFLOPS for the 
whole smearing routine. '''Double precision throughput on this de- 
vice is restricted. 
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Figure 1 : Performance dependence of expression Ei on A'threads for 2 
L Ja = 32 lattice (single precision). 

memory, compute capability 2.0, 15 Streaming Multi- 
processors and a total of 480 CUDA cores. Double pre- 
cision on this device is restricted as it utilises the GFIOO 
chip designed for the consumer market. The NVIDIA 
CUDA 4.0 toolkit (Release Candidate 2) was used with 
the NVIDIA Linux kernel driver version 270.40. 

The Chroma Jacobi smearing routine was applied to 
lattice objects where lattice sizes ranged from N - x 
16 to - 32^ X 64 in single precision and from - 
8^ X 16 to = 24^ X 48 in double precision. 

The first benchmark analysis studied the performance 
^'(Mhreads,A^site,£') for Varying threads keeping £ = £2 
and A^site = 1,2,4,8 fixed for a 32^ x 64 lattice in single 
precision. Fig.[T|shows the result. The best performance 
was achieved when setting the thread block size equal to 
the warp size Mhreads - 32. 

The second benchmark analysis focused on the per- 
formance dependence on A^site keeping E - E2 and 
Mhreads = 32, 64, 128, 256 fixed for a 32^ x 64 lattice in 
single precision. Fig.|2]shows the result. Only a moder- 
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Figure 2: Performance dependence of expression £2 on Wjitc for a 
L^la = 32 lattice (single precision). 
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Figure 3: Benchmark result for Jacobi smearing on a NVIDIA 
GeForce GTX 480 in comparison to the Intel Xeon CPU. The number 
of lattice sites is given hy N = 2(Li/a)'*. 



ate dependence was seen on this configuration parame- 
ter. Although when configuring for 32 threads a slightly 
better performance is achieved when using A^site = 4 or 
8 instead of A^site = 1 ■ 

These benchmark analyses were repeated for the ex- 
pressions E = Ei), El, E},, E4. The same characteristics 
were observed: Setting the parameter Mhreads = 32 re- 
sulted in all cases clearly to the best performance with 
only a moderate dependence on A^site- 

A final benchmark analysis was carried out for the 
whole smearing routine (including all five expressions) 
for different lattice sizes (single and double precision) 
in comparison to executing the same routine on the host 
CPU, an Intel Xeon CPU (E5507, 4 cores, 2.27GHz). 
Tab. [T] shows the benchmark results in numbers (sus- 



tained GFLOPS for single and double precision). Fig.|3] 
shows the result graphically. For the 32^ x 64 lattice in 
single precision a speedup factor (compared to the CPU) 
of more than 14 was measured. 

9. Conclusion, Outlook and Discussion 

As a first step, QDP+-I- expression evaluation was ac- 
celerated on the GPU by leveraging the ET technique 
on the device memory domain. Solely with acceleration 
a significant speedup factor for the evaluation could be 
achieved. 

Providing an auto-tuning mechanism for the ap- 
proach described here is not straight forward since the 
individual expressions £, are not known prior to pro- 
duction. Establishing an auto-tuning mechanism forms 
part of the to-do list. However, as the benchmark mea- 
surements showed there seems to be a preferred thread 
geometry that leads to the best performance. 

Optimisation would be a next major step. By chang- 
ing the data layout in such a way that coalescing device 
memory access takes place an even higher speedup fac- 
tor is expected. 

A next step in another direction would be paralleli- 
sation to multiple GPUs per host and targeting for the 
parallel architecture of QDP++ to extend this approach 
to multiple hosts. 

One might also want to move the shared linking 
loader to an external daemon program or the system 
loader. In this way the JIT compilation takes place only 
once - even across several Chroma runs. 

Utilising QDP++ profiling would eliminate the de- 
pendence on an external code generator. 

Software Availability 

QDP-1--1- and Chroma are available as open source 
software ifTTll . QDP-(--i- configurable for GPU evalua- 
tion is available fTTj. 

The GPU portion of QDP-n- requires the upcom- 
ing release 4.0 of NIVIDA CUDA |[T1 and devices with 
compute capability lo less than 2.0 are required. 
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